The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged on every user irrespective of usage, while others are charged under specified circumstances.
Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers’ and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas
You as a Data scientist at Thera bank need to come up with a classification model that will help bank improve their services so that customers do not renounce their credit cards
import warnings
warnings.filterwarnings('ignore')
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# libraries to help with data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
# Libraries to tune model, get different metric score, and split data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.impute import KNNImputer
from sklearn.pipeline import Pipeline, make_pipeline
# Libraries to help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier)
from xgboost import XGBClassifier
data = pd.read_csv('BankChurners.csv')
data.head()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | ... | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | ... | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | ... | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | ... | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 769911858 | Existing Customer | 40 | F | 4 | High School | Unknown | Less than $40K | Blue | 34 | ... | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | ... | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 |
5 rows × 21 columns
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 10127 non-null object 6 Marital_Status 10127 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
data.nunique()
CLIENTNUM 10127 Attrition_Flag 2 Customer_Age 45 Gender 2 Dependent_count 6 Education_Level 7 Marital_Status 4 Income_Category 6 Card_Category 4 Months_on_book 44 Total_Relationship_Count 6 Months_Inactive_12_mon 7 Contacts_Count_12_mon 7 Credit_Limit 6205 Total_Revolving_Bal 1974 Avg_Open_To_Buy 6813 Total_Amt_Chng_Q4_Q1 1158 Total_Trans_Amt 5033 Total_Trans_Ct 126 Total_Ct_Chng_Q4_Q1 830 Avg_Utilization_Ratio 964 dtype: int64
data.drop(columns=['CLIENTNUM'], inplace=True)
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Customer_Age | 10127.0 | 46.325960 | 8.016814 | 26.0 | 41.000 | 46.000 | 52.000 | 73.000 |
| Dependent_count | 10127.0 | 2.346203 | 1.298908 | 0.0 | 1.000 | 2.000 | 3.000 | 5.000 |
| Months_on_book | 10127.0 | 35.928409 | 7.986416 | 13.0 | 31.000 | 36.000 | 40.000 | 56.000 |
| Total_Relationship_Count | 10127.0 | 3.812580 | 1.554408 | 1.0 | 3.000 | 4.000 | 5.000 | 6.000 |
| Months_Inactive_12_mon | 10127.0 | 2.341167 | 1.010622 | 0.0 | 2.000 | 2.000 | 3.000 | 6.000 |
| Contacts_Count_12_mon | 10127.0 | 2.455317 | 1.106225 | 0.0 | 2.000 | 2.000 | 3.000 | 6.000 |
| Credit_Limit | 10127.0 | 8631.953698 | 9088.776650 | 1438.3 | 2555.000 | 4549.000 | 11067.500 | 34516.000 |
| Total_Revolving_Bal | 10127.0 | 1162.814061 | 814.987335 | 0.0 | 359.000 | 1276.000 | 1784.000 | 2517.000 |
| Avg_Open_To_Buy | 10127.0 | 7469.139637 | 9090.685324 | 3.0 | 1324.500 | 3474.000 | 9859.000 | 34516.000 |
| Total_Amt_Chng_Q4_Q1 | 10127.0 | 0.759941 | 0.219207 | 0.0 | 0.631 | 0.736 | 0.859 | 3.397 |
| Total_Trans_Amt | 10127.0 | 4404.086304 | 3397.129254 | 510.0 | 2155.500 | 3899.000 | 4741.000 | 18484.000 |
| Total_Trans_Ct | 10127.0 | 64.858695 | 23.472570 | 10.0 | 45.000 | 67.000 | 81.000 | 139.000 |
| Total_Ct_Chng_Q4_Q1 | 10127.0 | 0.712222 | 0.238086 | 0.0 | 0.582 | 0.702 | 0.818 | 3.714 |
| Avg_Utilization_Ratio | 10127.0 | 0.274894 | 0.275691 | 0.0 | 0.023 | 0.176 | 0.503 | 0.999 |
# Making a list of all the categorical variables
cat_cols = ['Attrition_Flag','Gender','Education_Level','Marital_Status','Income_Category','Card_Category']
# Printing number count of each unique value
for col in cat_cols:
print(data[col].value_counts())
print('-'*40)
Existing Customer 8500 Attrited Customer 1627 Name: Attrition_Flag, dtype: int64 ---------------------------------------- F 5358 M 4769 Name: Gender, dtype: int64 ---------------------------------------- Graduate 3128 High School 2013 Unknown 1519 Uneducated 1487 College 1013 Post-Graduate 516 Doctorate 451 Name: Education_Level, dtype: int64 ---------------------------------------- Married 4687 Single 3943 Unknown 749 Divorced 748 Name: Marital_Status, dtype: int64 ---------------------------------------- Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 Unknown 1112 $120K + 727 Name: Income_Category, dtype: int64 ---------------------------------------- Blue 9436 Silver 555 Gold 116 Platinum 20 Name: Card_Category, dtype: int64 ----------------------------------------
# While doing uni-variate analysis of numerical variables, we want to study their central tendency and dispersion.
# Let's write a function that will help us create boxplot and histogram for any numerical input variables.
def histogram_boxplot(feature, figsize=(15,10), bins=None):
"""Boxplot and histogram combined
feature: 1-d feature array
figsize: size of fig (default (9,8))
bins: number of bins (defualt None / auto)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.distplot(
feature, kde=F, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.distplot(
feature, kde=False, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
np.mean(feature), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
np.median(feature), color="black", linestyle="-"
) # Add median to the histogram
data.columns
Index(['Attrition_Flag', 'Customer_Age', 'Gender', 'Dependent_count',
'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category',
'Months_on_book', 'Total_Relationship_Count', 'Months_Inactive_12_mon',
'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio'],
dtype='object')
# Observations on Customer_Age
histogram_boxplot(data['Customer_Age'])
data[data['Customer_Age'] >= 70]
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 251 | Existing Customer | 73 | M | 0 | High School | Married | $40K - $60K | Blue | 36 | 5 | 3 | 2 | 4469.0 | 1125 | 3344.0 | 1.363 | 1765 | 34 | 1.615 | 0.252 |
| 254 | Existing Customer | 70 | M | 0 | High School | Married | Less than $40K | Blue | 56 | 3 | 2 | 3 | 3252.0 | 1495 | 1757.0 | 0.581 | 1227 | 15 | 0.875 | 0.460 |
# Clipping values for Customer_Age at next highest value 70.
data['Customer_Age'].clip(upper=70, inplace=True)
histogram_boxplot(data['Dependent_count'])
histogram_boxplot(data['Months_on_book'])
histogram_boxplot(data['Total_Relationship_Count'])
histogram_boxplot(data['Months_Inactive_12_mon'])
histogram_boxplot(data['Contacts_Count_12_mon'])
histogram_boxplot(data['Credit_Limit'])
histogram_boxplot(data['Total_Revolving_Bal'])
histogram_boxplot(data['Avg_Open_To_Buy'])
histogram_boxplot(data['Total_Amt_Chng_Q4_Q1'])
# Checking 10 largest values of amount spend on meat products
data.Total_Amt_Chng_Q4_Q1.nlargest(10)
8 2.675 12 2.675 773 2.675 2 2.594 219 2.368 47 2.357 46 2.316 658 2.282 58 2.275 466 2.271 Name: Total_Amt_Chng_Q4_Q1, dtype: float64
# Capping values for Total Amt chng Q4 over Q1 at next highest value i.e. 2.675
data["Total_Amt_Chng_Q4_Q1"].clip(upper= 2.675, inplace=True)
histogram_boxplot(data['Total_Trans_Amt'])
histogram_boxplot(data['Total_Trans_Ct'])
histogram_boxplot(data['Total_Ct_Chng_Q4_Q1'])
data.Total_Ct_Chng_Q4_Q1.nlargest(30)
1 3.714 773 3.571 269 3.500 12 3.250 113 3.000 190 3.000 146 2.875 366 2.750 30 2.571 4 2.500 805 2.500 2510 2.500 158 2.429 68 2.400 280 2.400 2 2.333 3 2.333 167 2.286 239 2.273 757 2.222 1095 2.222 131 2.200 91 2.182 162 2.167 309 2.100 294 2.083 13 2.000 69 2.000 84 2.000 231 2.000 Name: Total_Ct_Chng_Q4_Q1, dtype: float64
# Clipping values for Total count chng Q4 over Q1 at next highest value i.e. 2.0
data["Total_Amt_Chng_Q4_Q1"].clip(upper= 2.000, inplace=True)
histogram_boxplot(data['Avg_Utilization_Ratio'])
def perc_on_bar(feature):
"""
plot
feature: categorical feature
the function won't work if a column is passed in hue parameter
"""
# Creating a countplot for the feature
sns.set(rc={"figure.figsize": (10, 5)})
ax = sns.countplot(x=feature, data=data)
total = len(feature) # length of the column
for p in ax.patches:
percentage = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
x = p.get_x() + p.get_width() / 2 - 0.1 # width of the plot
y = p.get_y() + p.get_height() # hieght of the plot
ax.annotate(percentage, (x, y), size=14) # annotate the percantage
plt.show() # show the plot
# observations on Attrition_Flag
perc_on_bar(data["Attrition_Flag"])
# observations on Gender
perc_on_bar(data["Gender"])
# observations on Education_Level
perc_on_bar(data["Education_Level"])
# observations on Marital_Status
perc_on_bar(data["Marital_Status"])
# observations on Income_Category
perc_on_bar(data["Income_Category"])
# observations on Card_Category
perc_on_bar(data["Card_Category"])
sns.pairplot(data)
<seaborn.axisgrid.PairGrid at 0x7fd05b38fc70>
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(y="Avg_Utilization_Ratio", x="Gender", data=data, orient="vertical")
<AxesSubplot:xlabel='Gender', ylabel='Avg_Utilization_Ratio'>
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(y="Avg_Utilization_Ratio", x="Marital_Status", data=data, orient="vertical")
<AxesSubplot:xlabel='Marital_Status', ylabel='Avg_Utilization_Ratio'>
cols = data[["Total_Trans_Amt","Total_Trans_Ct", "Months_Inactive_12_mon","Months_on_book", "Credit_Limit", "Avg_Utilization_Ratio"]].columns.tolist()
plt.figure(figsize=(10, 10))
for i, variable in enumerate(cols):
plt.subplot(3, 2, i + 1)
sns.boxplot(data["Attrition_Flag"], data[variable])
plt.tight_layout()
plt.title(variable)
plt.show()
### Function to plot stacked bar charts for categorical columns
def stacked_plot(x):
sns.set(palette="nipy_spectral")
tab1 = pd.crosstab(x, data["Attrition_Flag"], margins=True)
print(tab1)
print("-" * 120)
tab = pd.crosstab(x, data["Attrition_Flag"], normalize="index")
tab.plot(kind="bar", stacked=True, figsize=(10, 5))
plt.legend(loc="lower left", frameon=False)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
stacked_plot(data["Marital_Status"])
Attrition_Flag Attrited Customer Existing Customer All Marital_Status Divorced 121 627 748 Married 709 3978 4687 Single 668 3275 3943 Unknown 129 620 749 All 1627 8500 10127 ------------------------------------------------------------------------------------------------------------------------
stacked_plot(data["Income_Category"])
Attrition_Flag Attrited Customer Existing Customer All Income_Category $120K + 126 601 727 $40K - $60K 271 1519 1790 $60K - $80K 189 1213 1402 $80K - $120K 242 1293 1535 Less than $40K 612 2949 3561 Unknown 187 925 1112 All 1627 8500 10127 ------------------------------------------------------------------------------------------------------------------------
stacked_plot(data["Education_Level"])
Attrition_Flag Attrited Customer Existing Customer All Education_Level College 154 859 1013 Doctorate 95 356 451 Graduate 487 2641 3128 High School 306 1707 2013 Post-Graduate 92 424 516 Uneducated 237 1250 1487 Unknown 256 1263 1519 All 1627 8500 10127 ------------------------------------------------------------------------------------------------------------------------
stacked_plot(data["Card_Category"])
Attrition_Flag Attrited Customer Existing Customer All Card_Category Blue 1519 7917 9436 Gold 21 95 116 Platinum 5 15 20 Silver 82 473 555 All 1627 8500 10127 ------------------------------------------------------------------------------------------------------------------------
sns.set(rc={"figure.figsize": (15, 15)})
sns.heatmap(
data.corr(),
annot=True,
linewidths=0.5,
center=0,
cbar=False,
cmap="YlGnBu",
fmt="0.2f",
)
plt.show()
- take care target variable ( convert to 1 and 0)
- drop the columns
- missing treatment
-
data1 = data.copy()
data['Attrition_Flag']
0 Existing Customer
1 Existing Customer
2 Existing Customer
3 Existing Customer
4 Existing Customer
...
10122 Existing Customer
10123 Attrited Customer
10124 Attrited Customer
10125 Attrited Customer
10126 Attrited Customer
Name: Attrition_Flag, Length: 10127, dtype: object
data1.loc[data1['Attrition_Flag'] == 'Attrited Customer', 'Attrition_Flag'] = 1
data1.loc[data1['Attrition_Flag'] == 'Existing Customer', 'Attrition_Flag'] = 0
data1['Attrition_Flag'].value_counts()
0 8500 1 1627 Name: Attrition_Flag, dtype: int64
data1['Gender'] = data1['Gender'].astype('category')
data1['Card_Category'] = data1['Card_Category'].astype('category')
data1['Attrition_Flag'] = data1['Attrition_Flag'].astype('category')
data1.drop(columns=['Avg_Open_To_Buy','Total_Trans_Ct','Customer_Age'], inplace=True)
col_for_impute = ['Education_Level', 'Marital_Status','Income_Category' ]
data1[col_for_impute].head()
| Education_Level | Marital_Status | Income_Category | |
|---|---|---|---|
| 0 | High School | Married | $60K - $80K |
| 1 | Graduate | Single | Less than $40K |
| 2 | Graduate | Married | $80K - $120K |
| 3 | High School | Unknown | Less than $40K |
| 4 | Uneducated | Married | $60K - $80K |
# Convert Unknown values to None, so we can use KNN
data2 = data1.copy()
for col in col_for_impute:
data2.loc[data2[col]=='Unknown',col] = None
data2.isnull().sum()
Attrition_Flag 0 Gender 0 Dependent_count 0 Education_Level 1519 Marital_Status 749 Income_Category 1112 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
for col in col_for_impute:
print(data2[col].value_counts())
print('#'*30)
Graduate 3128 High School 2013 Uneducated 1487 College 1013 Post-Graduate 516 Doctorate 451 Name: Education_Level, dtype: int64 ############################## Married 4687 Single 3943 Divorced 748 Name: Marital_Status, dtype: int64 ############################## Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 $120K + 727 Name: Income_Category, dtype: int64 ##############################
# we need to pass numberical values for each categorical columns for KNN imputation so we will label encode them
education_level = {'Graduate':0, 'High School':1, 'Uneducated':2, 'College':3, 'Post-Graduate':4, 'Doctorate':5 }
data2['Education_Level'] = data2['Education_Level'].map(education_level)
marital_status = {'Married':0, 'Single':1, 'Divorced':2}
data2['Marital_Status'] = data2['Marital_Status'].map(marital_status)
income_category = {'$40K - $60K':0, '$80K - $120K':1, '$60K - $80K':2, '$120K +':3}
data2['Income_Category'] = data2['Income_Category'].map(income_category)
data2.head()
| Attrition_Flag | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | M | 3 | 1.0 | 0.0 | 2.0 | Blue | 39 | 5 | 1 | 3 | 12691.0 | 777 | 1.335 | 1144 | 1.625 | 0.061 |
| 1 | 0 | F | 5 | 0.0 | 1.0 | NaN | Blue | 44 | 6 | 1 | 2 | 8256.0 | 864 | 1.541 | 1291 | 3.714 | 0.105 |
| 2 | 0 | M | 3 | 0.0 | 0.0 | 1.0 | Blue | 36 | 4 | 1 | 0 | 3418.0 | 0 | 2.000 | 1887 | 2.333 | 0.000 |
| 3 | 0 | F | 4 | 1.0 | NaN | NaN | Blue | 34 | 3 | 4 | 1 | 3313.0 | 2517 | 1.405 | 1171 | 2.333 | 0.760 |
| 4 | 0 | M | 3 | 2.0 | 0.0 | 2.0 | Blue | 21 | 5 | 1 | 0 | 4716.0 | 0 | 2.000 | 816 | 2.500 | 0.000 |
# Separating target variable and other variables
X = data2.drop(columns='Attrition_Flag')
Y = data2['Attrition_Flag']
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=7, stratify=Y)
print(X_train.shape, X_test.shape)
(7088, 16) (3039, 16)
imputer = KNNImputer(n_neighbors=5)
#Fit and transform the train data
X_train[col_for_impute]=imputer.fit_transform(X_train[col_for_impute])
#Transform the test data
X_test[col_for_impute]=imputer.transform(X_test[col_for_impute])
# Checking that no column has missing values in the train or test datasets.
print(X_train.isna().sum())
print('-'*30)
print(X_test.isna().sum())
Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64 ------------------------------ Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
## Function to inverse the encoding
def inverse_mapping(x,y):
inv_dict = {v: k for k, v in x.items()}
X_train[y] = np.round(X_train[y]).map(inv_dict).astype('category')
X_test[y] = np.round(X_test[y]).map(inv_dict).astype('category')
X_train.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 7088 entries, 5672 to 6456 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Gender 7088 non-null category 1 Dependent_count 7088 non-null int64 2 Education_Level 7088 non-null float64 3 Marital_Status 7088 non-null float64 4 Income_Category 7088 non-null float64 5 Card_Category 7088 non-null category 6 Months_on_book 7088 non-null int64 7 Total_Relationship_Count 7088 non-null int64 8 Months_Inactive_12_mon 7088 non-null int64 9 Contacts_Count_12_mon 7088 non-null int64 10 Credit_Limit 7088 non-null float64 11 Total_Revolving_Bal 7088 non-null int64 12 Total_Amt_Chng_Q4_Q1 7088 non-null float64 13 Total_Trans_Amt 7088 non-null int64 14 Total_Ct_Chng_Q4_Q1 7088 non-null float64 15 Avg_Utilization_Ratio 7088 non-null float64 dtypes: category(2), float64(7), int64(7) memory usage: 844.8 KB
inverse_mapping(education_level, 'Education_Level')
inverse_mapping(marital_status, 'Marital_Status')
inverse_mapping(income_category, 'Income_Category')
for col in col_for_impute:
print(X_train[col].value_counts())
print('#'*30)
Graduate 2631 High School 1554 Uneducated 1516 College 713 Post-Graduate 352 Doctorate 322 Name: Education_Level, dtype: int64 ############################## Married 3420 Single 3130 Divorced 538 Name: Marital_Status, dtype: int64 ############################## $80K - $120K 3900 $40K - $60K 1515 $60K - $80K 1169 $120K + 504 Name: Income_Category, dtype: int64 ##############################
for col in col_for_impute:
print(X_test[col].value_counts())
print('#'*30)
Graduate 1098 High School 721 Uneducated 627 College 300 Post-Graduate 164 Doctorate 129 Name: Education_Level, dtype: int64 ############################## Married 1521 Single 1308 Divorced 210 Name: Marital_Status, dtype: int64 ############################## $80K - $120K 1632 $40K - $60K 648 $60K - $80K 536 $120K + 223 Name: Income_Category, dtype: int64 ##############################
X_train.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 7088 entries, 5672 to 6456 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Gender 7088 non-null category 1 Dependent_count 7088 non-null int64 2 Education_Level 7088 non-null category 3 Marital_Status 7088 non-null category 4 Income_Category 7088 non-null category 5 Card_Category 7088 non-null category 6 Months_on_book 7088 non-null int64 7 Total_Relationship_Count 7088 non-null int64 8 Months_Inactive_12_mon 7088 non-null int64 9 Contacts_Count_12_mon 7088 non-null int64 10 Credit_Limit 7088 non-null float64 11 Total_Revolving_Bal 7088 non-null int64 12 Total_Amt_Chng_Q4_Q1 7088 non-null float64 13 Total_Trans_Amt 7088 non-null int64 14 Total_Ct_Chng_Q4_Q1 7088 non-null float64 15 Avg_Utilization_Ratio 7088 non-null float64 dtypes: category(5), float64(4), int64(7) memory usage: 699.9 KB
X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first =True)
print(X_train.shape, X_test.shape)
(7088, 25) (3039, 25)
## Function to calculate different metric scores of the model - Accuracy, Recall and Precision
def get_metrics_score(model, flag=True):
"""
model : classifier to predict values of X
"""
# defining an empty list to store train and test results
score_list = []
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
train_acc = model.score(X_train, y_train)
test_acc = model.score(X_test, y_test)
train_recall = metrics.recall_score(y_train, pred_train)
test_recall = metrics.recall_score(y_test, pred_test)
train_precision = metrics.precision_score(y_train, pred_train)
test_precision = metrics.precision_score(y_test, pred_test)
score_list.extend(
(
train_acc,
test_acc,
train_recall,
test_recall,
train_precision,
test_precision,
)
)
# If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
if flag == True:
print("Accuracy on training set : ", model.score(X_train, y_train))
print("Accuracy on test set : ", model.score(X_test, y_test))
print("Recall on training set : ", metrics.recall_score(y_train, pred_train))
print("Recall on test set : ", metrics.recall_score(y_test, pred_test))
print(
"Precision on training set : ", metrics.precision_score(y_train, pred_train)
)
print("Precision on test set : ", metrics.precision_score(y_test, pred_test))
return score_list # returning the list with train and test scores
## Function to create confusion matrix
def make_confusion_matrix(model, y_actual, labels=[1, 0]):
"""
model : classifier to predict values of X
y_actual : ground truth
"""
y_predict = model.predict(X_test)
cm = metrics.confusion_matrix(y_actual, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(
cm,
index=[i for i in ["Actual - No", "Actual - Yes"]],
columns=[i for i in ["Predicted - No", "Predicted - Yes"]],
)
group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm.flatten() / np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2, 2)
plt.figure(figsize=(10, 7))
sns.heatmap(df_cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
from sklearn.linear_model import LogisticRegression
lgr_model = LogisticRegression(random_state = 7)
lgr_model.fit(X_train, y_train)
LogisticRegression(random_state=7)
# Calculating different metrics
get_metrics_score(lgr_model)
# Creating confusing matrix
make_confusion_matrix(lgr_model, y_test)
Accuracy on training set : 0.8768340857787811 Accuracy on test set : 0.8739717012175058 Recall on training set : 0.3520632133450395 Recall on test set : 0.3114754098360656 Precision on training set : 0.7481343283582089 Precision on test set : 0.7638190954773869
from imblearn.over_sampling import SMOTE
print('Before UpSampling counts of Label "1" : {}'.format(sum(y_train==1)))
print('Before UpSampling counts of Label "0" : {} \n'.format(sum(y_train==0)))
sm = SMOTE(sampling_strategy=1, k_neighbors =5, random_state=1) #Synthetic Minority Over Sampling Technique
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print('After UpSampling, counts of label "1" : {}'.format(sum(y_train_over==1)))
print('After UpSampling, counts of label "0" : {} \n'.format(sum(y_train_over==0)))
print('After UpSampling, the shape of train_X: {}'.format(X_train_over.shape))
print('After UpSampling, the shape of train_y: {} \n'.format(y_train_over.shape))
Before UpSampling counts of Label "1" : 1139 Before UpSampling counts of Label "0" : 5949 After UpSampling, counts of label "1" : 5949 After UpSampling, counts of label "0" : 5949 After UpSampling, the shape of train_X: (11898, 25) After UpSampling, the shape of train_y: (11898,)
lgr_model_over = LogisticRegression(random_state=1)
# Training the basic logistic regression model with training set
lgr_model_over.fit(X_train_over, y_train_over)
LogisticRegression(random_state=1)
# Calculating different metrics
get_metrics_score(lgr_model_over)
# Creating confusing matrix
make_confusion_matrix(lgr_model_over, y_test)
Accuracy on training set : 0.7632618510158014 Accuracy on test set : 0.7489305692662059 Recall on training set : 0.7058823529411765 Recall on test set : 0.6721311475409836 Precision on training set : 0.37447601304145317 Precision on test set : 0.3523093447905478
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state = 1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
print("Before Under Sampling, counts of label '1': {}".format(sum(y_train==1)))
print("Before Under Sampling, counts of label '0': {} \n".format(sum(y_train==0)))
print("After Under Sampling, counts of label '1': {}".format(sum(y_train_un==1)))
print("After Under Sampling, counts of label '0': {} \n".format(sum(y_train_un==0)))
print('After Under Sampling, the shape of train_X: {}'.format(X_train_un.shape))
print('After Under Sampling, the shape of train_y: {} \n'.format(y_train_un.shape))
Before Under Sampling, counts of label '1': 1139 Before Under Sampling, counts of label '0': 5949 After Under Sampling, counts of label '1': 1139 After Under Sampling, counts of label '0': 1139 After Under Sampling, the shape of train_X: (2278, 25) After Under Sampling, the shape of train_y: (2278,)
lgr_model_under = LogisticRegression(random_state=5)
lgr_model_under.fit(X_train_un, y_train_un)
LogisticRegression(random_state=5)
# Calculating different metrics
get_metrics_score(lgr_model_under)
# Creating confusing matrix
make_confusion_matrix(lgr_model_under, y_test)
Accuracy on training set : 0.7502821670428894 Accuracy on test set : 0.7416913458374466 Recall on training set : 0.742756804214223 Recall on test set : 0.7233606557377049 Precision on training set : 0.3641842445114077 Precision on test set : 0.35194416749750745
decisionTree_model = DecisionTreeClassifier(criterion='gini', class_weight={0:0.15,1:0.85}, random_state=5)
decisionTree_model.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, random_state=5)
# Calculating different metrics
get_metrics_score(decisionTree_model)
# Creating confusing matrix
make_confusion_matrix(decisionTree_model, y_test)
Accuracy on training set : 1.0 Accuracy on test set : 0.9134583744652847 Recall on training set : 1.0 Recall on test set : 0.7213114754098361 Precision on training set : 1.0 Precision on test set : 0.7348643006263048
rf_model = RandomForestClassifier(random_state=5, class_weight='balanced')
rf_model.fit(X_train, y_train)
RandomForestClassifier(class_weight='balanced', random_state=5)
# Calculating different metrics
get_metrics_score(rf_model)
# Creating confusing matrix
make_confusion_matrix(rf_model, y_test)
Accuracy on training set : 0.9998589164785553 Accuracy on test set : 0.9387956564659428 Recall on training set : 0.9991220368744512 Recall on test set : 0.6721311475409836 Precision on training set : 1.0 Precision on test set : 0.9265536723163842
from sklearn.ensemble import BaggingClassifier
bagging_model = BaggingClassifier(random_state=5)
bagging_model.fit(X_train, y_train)
BaggingClassifier(random_state=5)
# Calculating different metrics
get_metrics_score(bagging_model)
# Creating confusing matrix
make_confusion_matrix(bagging_model, y_test)
Accuracy on training set : 0.9956264108352144 Accuracy on test set : 0.9430733794011188 Recall on training set : 0.9736611062335382 Recall on test set : 0.7397540983606558 Precision on training set : 0.9990990990990991 Precision on test set : 0.8869778869778869
adb_model = AdaBoostClassifier(random_state=5)
adb_model.fit(X_train, y_train)
AdaBoostClassifier(random_state=5)
# Calculating different metrics
get_metrics_score(adb_model)
# Creating confusing matrix
make_confusion_matrix(adb_model, y_test)
Accuracy on training set : 0.9398984198645598 Accuracy on test set : 0.9262915432708128 Recall on training set : 0.7673397717295873 Recall on test set : 0.6987704918032787 Precision on training set : 0.8444444444444444 Precision on test set : 0.8157894736842105
gbc_model = GradientBoostingClassifier(random_state=5)
gbc_model.fit(X_train, y_train)
GradientBoostingClassifier(random_state=5)
# Calculating different metrics
get_metrics_score(gbc_model)
# Creating confusing matrix
make_confusion_matrix(gbc_model, y_test)
Accuracy on training set : 0.9631772009029346 Accuracy on test set : 0.945705824284304 Recall on training set : 0.8331870061457419 Recall on test set : 0.7438524590163934 Precision on training set : 0.9303921568627451 Precision on test set : 0.9007444168734491
xgb_model = XGBClassifier(random_state=5, eval_metric='logloss')
xgb_model.fit(X_train, y_train)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, eval_metric='logloss',
gamma=0, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.300000012,
max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=100, n_jobs=8,
num_parallel_tree=1, random_state=5, reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, subsample=1, tree_method='exact',
validate_parameters=1, verbosity=None)
# Calculating different metrics
get_metrics_score(xgb_model)
# Creating confusing matrix
make_confusion_matrix(xgb_model, y_test)
Accuracy on training set : 1.0 Accuracy on test set : 0.9598552155314248 Recall on training set : 1.0 Recall on test set : 0.8217213114754098 Precision on training set : 1.0 Precision on test set : 0.9197247706422018
%%time
# Creating pipeline
pipe = make_pipeline(StandardScaler(), AdaBoostClassifier(random_state=1))
# Parameter grid to pass in GridSearchCV
param_grid = {
"adaboostclassifier__n_estimators": np.arange(10, 110, 10),
"adaboostclassifier__learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"adaboostclassifier__base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, scoring=scorer, cv=5)
# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)
print(
"Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_)
)
Best Parameters:{'adaboostclassifier__base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1), 'adaboostclassifier__learning_rate': 0.2, 'adaboostclassifier__n_estimators': 100}
Score: 0.8314436973490997
CPU times: user 6min 15s, sys: 3.6 s, total: 6min 18s
Wall time: 6min 23s
# Creating new pipeline with best parameters
adb_turned1 = make_pipeline(
StandardScaler(),
AdaBoostClassifier(
base_estimator = DecisionTreeClassifier(max_depth=3, random_state=1),
n_estimators = 100,
learning_rate = 0.2,
random_state = 1
)
)
# Fit the model on training data
adb_turned1.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('adaboostclassifier',
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.2, n_estimators=100,
random_state=1))])
# Calculating different metrics
get_metrics_score(adb_turned1)
# Creating confusing matrix
make_confusion_matrix(adb_turned1, y_test)
Accuracy on training set : 0.9830699774266366 Accuracy on test set : 0.9555774925962488 Recall on training set : 0.9306409130816505 Recall on test set : 0.8094262295081968 Precision on training set : 0.9627611262488647 Precision on test set : 0.9038901601830663
%%time
# Creating pipeline
pipe = make_pipeline(StandardScaler(), GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),random_state=1))
# Parameter grid to pass in GridSearchCV
param_grid = {
"gradientboostingclassifier__n_estimators": [100,150,200,250],
"gradientboostingclassifier__subsample": [0.8, 0.9, 1],
"gradientboostingclassifier__max_features": [0.7,0.8,0.9,1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, scoring=scorer, cv=5)
# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)
print(
"Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_)
)
Best Parameters:{'gradientboostingclassifier__max_features': 0.7, 'gradientboostingclassifier__n_estimators': 250, 'gradientboostingclassifier__subsample': 1}
Score: 0.8314205116315015
CPU times: user 5min 28s, sys: 2.27 s, total: 5min 31s
Wall time: 5min 32s
# Creating new pipeline with best parameters
gdb_tuned1 = make_pipeline(
StandardScaler(),
GradientBoostingClassifier(
init=AdaBoostClassifier(random_state=1),
random_state=1,
max_features=0.7,
n_estimators=250,
subsample=1
)
)
# Fit the model on training data
gdb_tuned1.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('gradientboostingclassifier',
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.7, n_estimators=250,
random_state=1, subsample=1))])
# Calculating different metrics
get_metrics_score(gdb_tuned1)
# Creating confusing matrix
make_confusion_matrix(gdb_tuned1, y_test)
Accuracy on training set : 0.9810948081264108 Accuracy on test set : 0.9542612701546561 Recall on training set : 0.9218612818261633 Recall on test set : 0.7930327868852459 Precision on training set : 0.958904109589041 Precision on test set : 0.9105882352941177
%%time
# Creating pipeline
pipe = make_pipeline(StandardScaler(), BaggingClassifier(random_state=1))
# Parameter grid to pass in GridSearchCV
param_grid = {
"baggingclassifier__base_estimator": [DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1)],
"baggingclassifier__n_estimators": [5,7,15,51,101],
"baggingclassifier__max_features": [0.7,0.8,0.9,1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, scoring=scorer, cv=5)
# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)
print(
"Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_)
)
Best Parameters:{'baggingclassifier__base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1), 'baggingclassifier__max_features': 0.9, 'baggingclassifier__n_estimators': 5}
Score: 0.5329584975654996
CPU times: user 48.2 s, sys: 739 ms, total: 48.9 s
Wall time: 49.3 s
# Creating new pipeline with best parameters
bagging_tuned1 = make_pipeline(
StandardScaler(),
BaggingClassifier(
base_estimator = DecisionTreeClassifier(max_depth=3, random_state=1),
random_state=1,
max_features=0.9,
n_estimators=5,
)
)
# Fit the model on training data
bagging_tuned1.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('baggingclassifier',
BaggingClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
max_features=0.9, n_estimators=5,
random_state=1))])
# Calculating different metrics
get_metrics_score(bagging_tuned1)
# Creating confusing matrix
make_confusion_matrix(bagging_tuned1, y_test)
Accuracy on training set : 0.8972911963882618 Accuracy on test set : 0.8933859822309971 Recall on training set : 0.5267778753292361 Recall on test set : 0.4918032786885246 Precision on training set : 0.7604562737642585 Precision on test set : 0.759493670886076
%%time
# Creating pipeline
pipe = make_pipeline(StandardScaler(), AdaBoostClassifier(random_state=1))
# Parameter grid to pass in GridSearchCV
param_grid = {
"adaboostclassifier__n_estimators": np.arange(10, 110, 10),
"adaboostclassifier__learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"adaboostclassifier__base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
rdm_cv = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
rdm_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(rdm_cv.best_params_,rdm_cv.best_score_))
Best parameters are {'adaboostclassifier__n_estimators': 90, 'adaboostclassifier__learning_rate': 1, 'adaboostclassifier__base_estimator': DecisionTreeClassifier(max_depth=2, random_state=1)} with CV score=0.8314205116315015:
CPU times: user 2min 5s, sys: 1.02 s, total: 2min 6s
Wall time: 2min 6s
# Creating new pipeline with best parameters from RandomizedSearch CV
adb_turned2 = make_pipeline(
StandardScaler(),
AdaBoostClassifier(
base_estimator = DecisionTreeClassifier(max_depth=2, random_state=1),
n_estimators = 90,
learning_rate = 1,
random_state = 1
)
)
# Fit the model on training data
adb_turned2.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('adaboostclassifier',
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2,
random_state=1),
learning_rate=1, n_estimators=90,
random_state=1))])
# Calculating different metrics
get_metrics_score(adb_turned2)
# Creating confusing matrix
make_confusion_matrix(adb_turned2, y_test)
Accuracy on training set : 0.9841986455981941 Accuracy on test set : 0.9450477130635078 Recall on training set : 0.9411764705882353 Recall on test set : 0.7848360655737705 Precision on training set : 0.9597135183527306 Precision on test set : 0.8606741573033708
%%time
# Creating pipeline
pipe = make_pipeline(StandardScaler(), GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),random_state=1))
# Parameter grid to pass in GridSearchCV
param_grid = {
"gradientboostingclassifier__n_estimators": [100,150,200,250],
"gradientboostingclassifier__subsample": [0.8, 0.9, 1],
"gradientboostingclassifier__max_features": [0.7,0.8,0.9,1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
rdm_cv = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1)
# Fitting parameters in GridSeachCV
rdm_cv.fit(X_train, y_train)
print(
"Best Parameters:{} \nScore: {}".format(rdm_cv.best_params_, rdm_cv.best_score_)
)
Best Parameters:{'gradientboostingclassifier__subsample': 1, 'gradientboostingclassifier__n_estimators': 250, 'gradientboostingclassifier__max_features': 0.7}
Score: 0.8314205116315015
CPU times: user 5min 24s, sys: 1.81 s, total: 5min 25s
Wall time: 5min 26s
# Creating new pipeline with best parameters from RandomizedSearch CV
gdb_turned2 = make_pipeline(
StandardScaler(),
GradientBoostingClassifier(
init=AdaBoostClassifier(random_state=1),
random_state=1,
n_estimators = 250,
subsample = 1,
max_features = 0.7
)
)
# Fit the model on training data
gdb_turned2.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('gradientboostingclassifier',
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.7, n_estimators=250,
random_state=1, subsample=1))])
# Calculating different metrics
get_metrics_score(gdb_turned2)
# Creating confusing matrix
make_confusion_matrix(gdb_turned2, y_test)
Accuracy on training set : 0.9810948081264108 Accuracy on test set : 0.9542612701546561 Recall on training set : 0.9218612818261633 Recall on test set : 0.7930327868852459 Precision on training set : 0.958904109589041 Precision on test set : 0.9105882352941177
%%time
# Creating pipeline
pipe = make_pipeline(StandardScaler(), BaggingClassifier(random_state=1))
# Parameter grid to pass in GridSearchCV
param_grid = {
"baggingclassifier__base_estimator": [DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1)],
"baggingclassifier__n_estimators": [5,7,15,51,101],
"baggingclassifier__max_features": [0.7,0.8,0.9,1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
rdm_cv = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid, n_iter=50,scoring=scorer, cv=5, random_state=1)
# Fitting parameters in GridSeachCV
rdm_cv.fit(X_train, y_train)
print(
"Best Parameters:{} \nScore: {}".format(rdm_cv.best_params_, rdm_cv.best_score_)
)
Best Parameters:{'baggingclassifier__n_estimators': 5, 'baggingclassifier__max_features': 0.9, 'baggingclassifier__base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)}
Score: 0.5329584975654996
CPU times: user 40.4 s, sys: 391 ms, total: 40.8 s
Wall time: 40.8 s
# Creating new pipeline with best parameters
bagging_tuned2 = make_pipeline(
StandardScaler(),
BaggingClassifier(
base_estimator = DecisionTreeClassifier(max_depth=3, random_state=1),
random_state=1,
max_features=0.9,
n_estimators=5,
)
)
# Fit the model on training data
bagging_tuned2.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('baggingclassifier',
BaggingClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
max_features=0.9, n_estimators=5,
random_state=1))])
# Calculating different metrics
get_metrics_score(bagging_tuned2)
# Creating confusing matrix
make_confusion_matrix(bagging_tuned2, y_test)
Accuracy on training set : 0.8972911963882618 Accuracy on test set : 0.8933859822309971 Recall on training set : 0.5267778753292361 Recall on test set : 0.4918032786885246 Precision on training set : 0.7604562737642585 Precision on test set : 0.759493670886076
# defining list of models
models = [decisionTree_model,
rf_model,
bagging_model,
adb_model,
gbc_model,
xgb_model,
adb_turned1,
gdb_tuned1,
bagging_tuned1,
adb_turned2,
gdb_turned2,
bagging_tuned2]
# defining empty lists to add train and test results.
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
# looping through all the models to get the metrics score - Accuracy, Recall and Precision
for model in models:
j = get_metrics_score(model, False)
acc_train.append(j[0])
acc_test.append(j[1])
recall_train.append(j[2])
recall_test.append(j[3])
precision_train.append(j[4])
precision_test.append(j[5])
comparison_frame = pd.DataFrame(
{
"Model": [
"Decision Tree",
"Random Forest",
"Bagging",
"AdaBoost",
"Gradient Boost",
"XGBoost",
"GridSearchCV"
"AdaBoost with GridSearchCV",
"GradientBoost with GridSearchCV",
"Bagging with GridSearchCV",
"AdaBoost with RandomizedSearchCV",
"GradientBoost with RandomizedSearchCV",
"Bagging with RandomizedSearchCV",
],
"Train_Accuracy": acc_train,
"Test_Accuracy": acc_test,
"Train_Recall": recall_train,
"Test_Recall": recall_test,
"Train_Precision": precision_train,
"Test_Precision": precision_test,
}
)
# Sorting models in decreasing order of test recall
comparison_frame.sort_values(by="Test_Recall", ascending=False)
| Model | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | |
|---|---|---|---|---|---|---|---|
| 5 | XGBoost | 1.000000 | 0.959855 | 1.000000 | 0.821721 | 1.000000 | 0.919725 |
| 6 | GridSearchCVAdaBoost with GridSearchCV | 0.983070 | 0.955577 | 0.930641 | 0.809426 | 0.962761 | 0.903890 |
| 7 | GradientBoost with GridSearchCV | 0.981095 | 0.954261 | 0.921861 | 0.793033 | 0.958904 | 0.910588 |
| 10 | GradientBoost with RandomizedSearchCV | 0.981095 | 0.954261 | 0.921861 | 0.793033 | 0.958904 | 0.910588 |
| 9 | AdaBoost with RandomizedSearchCV | 0.984199 | 0.945048 | 0.941176 | 0.784836 | 0.959714 | 0.860674 |
| 4 | Gradient Boost | 0.963177 | 0.945706 | 0.833187 | 0.743852 | 0.930392 | 0.900744 |
| 2 | Bagging | 0.995626 | 0.943073 | 0.973661 | 0.739754 | 0.999099 | 0.886978 |
| 0 | Decision Tree | 1.000000 | 0.913458 | 1.000000 | 0.721311 | 1.000000 | 0.734864 |
| 3 | AdaBoost | 0.939898 | 0.926292 | 0.767340 | 0.698770 | 0.844444 | 0.815789 |
| 1 | Random Forest | 0.999859 | 0.938796 | 0.999122 | 0.672131 | 1.000000 | 0.926554 |
| 8 | Bagging with GridSearchCV | 0.897291 | 0.893386 | 0.526778 | 0.491803 | 0.760456 | 0.759494 |
| 11 | Bagging with RandomizedSearchCV | 0.897291 | 0.893386 | 0.526778 | 0.491803 | 0.760456 | 0.759494 |
feature_names = X_train.columns
importances = adb_turned1[1].feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>